The three different datasets were joined, which left us with a dataset of 8,057 observations for 23 variables. California has 8,057 census tracts.
As the social and economic variables come from a census, it can be expected that not all variables are available for each census tract. I will explore what data is missing when looking at the individual variables.
Census data
Population
The population in the tracts is also very diverse. From 0 to 39,454 people. The mean and median are 4,769 and 4,528 respectively.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 3417 4528 4769 5832 39454

As the distribution is long tailed I zoomed in to the data so it only shows 99% of the values. Most tracts have a population between 3,000 and 6,000 people. The optimun size of a tract is 4,000 people. 45 tracts have a population of zero.
## [1] 45
Zero population
Looking into this further I found that California has 21 tracts that are water-only. These tracts can be identified by the fact that their ids are in the range of the 9900s. Next to that tracts that have an id in the range of 9800s are special land-use census tracts, such as large parks or employment areas with little or no residential population (source). I added a variable to mark these special tracts.
Variable added:
- special (values: water, special_land or regular)

As can seen above, most of the tracts that have zero population are indeed either water-only or special land use tracts. However three tracts, 2 in LA and 1 in San Diego have no special indication. I looked up the tracts on this website, where you can see the tract on a map. The tract in San Diego is water, one of the LA tracts is the area where Universal Studios is located and the other tract seems to be a business park. In all three cases it seems the tract id should start with either 9900 or 9800.
I used the tigris library to find all counties and tracts in California and their lat and longs. Please note that only information is available for 8043 tracts, 14 tracts are water-only and are not included in the tigris library. I assume they are out of range of the state land boundaries. In order to plot the information I used this tutorial.

As can be clearly seen Lake Tahoe is made up of water-only tracts. The nothern islands of the Channel Island which are a national park are clearly marked as special land use tracts. Most special tracts lie in the Los Angeles county. Not all tracts are visible either because they are small or lie beyond the state land borders (for water-only tracts).
I decided to remove these 45 tracts due to their zero population, which results in a dataset of 8012 tracts (+ 24 variables).
## [1] 8012 24
Square miles
The size of the tracts is very diverse. The smallest track represent a 2 by 2 block in San Francisco while the largest tracts are close to 6,950 square miles (one tract contains Death Valley NP and the other Mojavo National Preserve). 75% of the tracts have a size of less than 1.787 square miles (3rd quantile).
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.022 0.396 0.729 19.405 1.787 6951.837

Even when the axis are transformed (log10), the distribtuion still has a long tail. 86% of the tracts are smaller than 5 square miles. Smaller tracts are more common than large tracts.
## [1] 85.84623
Population per square mile
As the distributions for both population and square miles are very diverse, it is unsurprisely that the population per square miles is also very diverse. From 0.37 to 173,337 people per square mile. With a median of 6,316 and a mean of 8,607 people. Numbers like 173,337 people per square mile seems very high, but can be explained by highly dense populated small tracts (couple of blocks) in cities.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.37 2694.90 6316.00 8607.28 11020.46 173336.58

The distribution is a mirror of the square miles distribution. Most tracts have a population between 2,500 and 10,000 people per square mile.
Households
Households can be split up in two groups, households that are owning their house and households that are renting. The average home ownership for all tracts is 54.2%, the median is 56.9%. Data is not available for 28 tracts.
Variable added:
- owner (percentage of households that own their house)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 36.46 56.86 54.17 73.65 100.00 28

Percentage home ownership is sort of linear till 70%, when it drops. Interesting to see is that there are more tracts were all households are renters (0% owner) than tracts where everybody is an owner, which makes sense as it is less likely that everyone owns their house in a particular tract but especially in cities it is not as unlikely that everybody rents.
Household size
The median and mean household size is 3 persons (rounded). Data is not available for 30 tracts.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.020 2.530 2.970 3.037 3.490 9.750 30

When we look at the histogram is seems that very few tracts (61) have an average household size of more than 5 members.
## [1] 61
## owner_hhsize renter_hhsize
## Min. : 0.000 Min. :0.000
## 1st Qu.: 2.530 1st Qu.:2.460
## Median : 2.950 Median :3.020
## Mean : 3.042 Mean :3.088
## 3rd Qu.: 3.490 3rd Qu.:3.660
## Max. :11.910 Max. :8.840
## NA's :30 NA's :30
Renter households are slightly larger than owner households, but the range is larger for owner households.

When looking at the distribtutions for both renter (blue) and owner (green) household sizes, we see that the owner distribtuion is taller and peaks around 3 people, while for renters there is a less distinguish peak. There are more renter households with 4 or 5 people, which is kind of surprising as you would expect that people who own their house to be older, financially able to buy a house and maybe have live in children than in general younger renters. However with the current housing situation in parts of California (like the San Francisco Bay area) this general rule might not be valid as more people are sharing housing with non-family members.
The peak at 0 is when there are really zero renter or owner households in that tract, or a very low number and then the variable was not computed by the Census Burea.
Participation rate
The paricipation rate is clustered around the mean and median, that lie both around 63-64%.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 59.20 64.10 63.38 68.83 100.00

The distribution looks normal. It is interesting that there are a number of tracts that have a partipation rate of 0 or 100%. Wondering about this as participation rate did not have any missing values for the 53 tracts that were missing many other (census) variables.
0% & 100%

I noticed that almost all tracts that have a participation rate of 100% are special land tracts. The tract in San Diego, is the airport so this probably should also be special_land tract.
I looked up some of the 0% tracts here and found that often only a prison or detention center was located in these tracts (regardless if they were marked special_land or not). I also noticed that when looking up some of these tracts, that their population is mainly male. I therefore added a variable for the male percentage of the population to this subset.

Only two tracts have a lower male percentage than expected. The tract with 50% is a special area and is a national park. The second lowest is a tract where only a VA medical center is located. All other tracts seem to be tracts with either prisons or detention centers.
The tracts for which the median income was missing seemed to overlap with the tracts that have a 0 or 100% participation rate as can be seen in the plot below that shows missing values for this subset.

I therefore decided to drop the tracts for which the median income is missing, which leaves us with a dataset of 7,959 tracts (+ 25 variables).
## [1] 7959 25
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.40 59.20 64.10 63.59 68.90 100.00
The statistics have not changed a lot, except that the min value is no longer 0, and the 100% partipation rates are only for special_land tracts in parks.
Unemployment rate
The unemployment rate ranges from 0 to 60.5%. The mean unemployment rate is 10.16% and the median 9.30%. There is a very long long tail.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 6.60 9.30 10.16 12.80 60.50

When only taking into account values that lie in the 99% range, we still see a longtail, but most tracts have a unemployment rate that is less than 16%.
Poverty rate
The poverty rate ranges from 0 to 91.8%. The mean is 16.49% and the median is 13.30%. Also here there is a long tail.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 7.30 13.30 16.49 23.20 91.80

Even when looking at only 95% of the values there is still a long tail. Most tracts have a poverty rate between 5 and 10%. It is not surprising that there is a longtail as higher poverty rates are more unlikely.
Broadband data
Download speed
Note that broadband data had to be summarized, the median was calculated per tract as the distribution on block level was very long tailed. The median advertised download speed ranges between 2 and 40 Mbps. There is one missing value.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 2.00 15.00 15.00 15.94 15.00 40.00 1

The median and 3rd quartile are both 15. This is also clear from the histogram, there are also peaks at 12, 25 and 30 Mbps.
Upload speed
Median advertised upload speed ranges from 1 to 20 Mbps. The mean and median are respectively 2 and 2.8 Mbps.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.000 2.000 2.000 2.822 3.000 20.000 1

There are only 384 (4.8%) tracts that have a median upload speed higher than 3 Mbps.
## [1] 384
Number of providers
Overal, tracts have on average ~7 providers (also the median).
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 4.000 6.000 7.000 7.232 8.000 13.000 1

50% of the tracts have between 6 and 8 providers (IQR).
Voting data
Voting participation
It seemed interesting to see the voting participation, calculating the total votes casted divided by the population per tract.
Variable added:
- voting_part (total votes divided by population)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.3523 27.6676 38.4196 39.3654 50.1880 340.5895 31
The median and mean are ~ 39%, meaning that 39% of the population per tract voted. Data for 31 tracts is missing.
As population is made up also from non-eligble voters (children etc) we are not expecting to see 100%+ values. 11 tracts have a a voting participation of more than 100%, but more tracts might be affected. I can think of two reasons why this is happening:
- Votes might have been assigned incorrectly when mapping from precincts to blocks. I have gone over this a couple of times and I cannot find where it goes wrong in my mapping, so I am not sure if there is something in the original mapping files or my method.
- Population and total votes number come from different datasets, so maybe numbers are assigned to wrong tracts.
## [1] 11
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3523 27.6467 38.3857 39.1951 50.1410 90.0239
Removing these 11 tracts, does not do much to the median and mean.

The distribution looks relatively normal, with not a clear peak though. 75% of the tracts have a voting participation of 50% or less.
Missing values
Data for 31 tracts is missing.

Plotting these tracts on a map shows that the whole county of Imperial is missing (the white tract was already dropped before) plus one tract in LA. Checking back with the database it notes that the data is not yet available for these tracts.
All census and broad variables are available for these tracts, so I am not dropping them. When looking at voting data these tracts will not be included.
Democrats vs Republicans
The added variable ‘winner’, shows which party got the most votes. In almost 6,500 tracts the Democratic party got the most votes. That is 80% of thet racts. (Tracts without a winner are in Imperial & LA, see above.)

## [1] 80.34929
## dem rep
## Min. :13.38 Min. : 0.00
## 1st Qu.:49.15 1st Qu.:15.36
## Median :64.34 Median :25.83
## Mean :62.13 Mean :28.59
## 3rd Qu.:76.00 3rd Qu.:40.82
## Max. :92.70 Max. :81.90
## NA's :31 NA's :31
For all tracts the median and mean percentage votes for Democrats are 64.34 and 62.13% respectively. The median and mean are 25.83 and 28.59% for the Republicans. The Democratic party is more popular than the Republican party in the Californian tracts.

The histograms show the distribution for the tracts were the party was the largest. Please note that not 50% of the votes is needed as multiple parties participated. It is clear that if the Democrats won, they mostly did with a higher percentage of the votes on average 68% vs 54% for the Republicans. The Republican distribution peaks before 50%, while the Democrats one only peaks at 70% and its distribution is more level. Also the maximal percentage of votes is lower for the Republicans, 10%-point.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 42.78 58.31 69.37 68.38 78.30 92.70
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 41.33 48.46 52.83 54.18 58.53 81.90
Summary
Structure
All together 45 (for having no population value) plus 53 tracts (for not having median income value) have been dropped. Which leaves us with data for 7959 tracts (+ 26 variables).
## [1] 7959 26

The dropped tracts are shown on the map. As can be seen it is only a couple of tracts per county, except for LA but LA has many tracts. We also know that these dropped tracts are either not inhabited or are water, special areas (national parks, business parks or jails/detention centers). Next to their special character also the multiple missing variables for these tracts justifies to not take them into account in the exploration.
There are 26 variables:
- Tract info - 4 (Id, county, county_name, special)
- Census variables - 14 (population, total_hh, owner_hh, renter_hh, total_hhsize, owner_hhsize, renter_hhsize, owner, median_income, participation_rate, unemployment_rate, poverty_rate, square_miles, pop_sqmiles)
- Broadband variables - 3 (median_down, median_up, nr_providers)
- Voting variables - 5 (total_vote, dem, rep, winner, voting_part)
When plotting the tracts on a map, 6 addditonal variables (including langitude and longitude coordinates) are needed.
Interesting variables
I am mostly interested in how population per square miles interacts with the other variables. And also how the median income and the winner (Democrats/Republicans) relate to other variables. I am also interested if certain tracts have less or more access to broadband.
The first exploration has shown that the census tracts in California are very diverse in size, income, politic color but also other social and economic variables. It will be interesting to see if there are difference between poor/rich tracts, Democrat or Republican tracts and low/high populated tracts.
New variables
Five variables were created:
- pop_sqmiles (population divided by square miles)
- winner (categorical variable showing party with most votes)
- special (specify track as water_only, special_ land or regular)
- voting_part (total votes divided by population)
- owner (% of households that own their housing)